Data integration with STACAS
Introduction
STACAS is a method for scRNA-seq data integration. It is based on the Seurat integration framework, and adds the following innovations:
- anchors are filtered/down-weighted based on their distance in reciprocal PCA space, calculated from the unscaled, normalized data
- integration trees are constructed based on the ‘centrality’ of datasets, as measured by the sum of their anchor (distance-weighted) scores
- cell labels, if known, can be given as input to the algorithm to perform semi-supervised integration
In this demo we will show the application of STACAS to integrate a collection of scRNA-seq datasets of immune cells from multiple donors, human tissues and studies, assembled by Luecken et al. for their excellent benchmark.
The data are available at: figshare/12420968
R environment
Get and load some useful packages
renv::restore()
if (!require("remotes", quietly = TRUE))
install.packages("remotes")
library(remotes)
install_github("carmonalab/STACAS", ref="dev")library(Seurat)
library(dplyr)
library(ggplot2)
library(STACAS)
seed = 1234
set.seed(seed)Load test datasets
Download the dataset of human immune cells assembled by Luecken et al., and convert them to Seurat objects.
download <- F
where <- 'aux'
dir.create(where, showWarnings = FALSE)
rds.path <- sprintf("%s/Immune_ALL_human.rds", where)
if(download){
options(timeout=500)
url <- "https://figshare.com/ndownloader/files/25717328"
h5.path <- sprintf("%s/Immune_ALL_human.rds", where)
download.file(url = url, destfile = h5.path)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("zellkonverter", quietly = TRUE))
install.packages("zellkonverter") # to convert from h5ad to R object
#Convert to Seurat
object.sce <- zellkonverter::readH5AD(h5.path)
object <- Seurat::as.Seurat(object.sce, counts = "counts", data = "X")
object <- RenameAssays(object, originalexp="RNA")
rm (object.sce)
Idents(object) <- "final_annotation"
saveRDS(object = object,file = rds.path)
}else{
object <- readRDS(rds.path)
}Cell types were annotated by the authors on each dataset
individually, using a common dictionary of cell types (see https://github.com/theislab/scib-reproducibility). These
are stored in the final_annotation metadata column. Study
of origin is stored in batch metadaa
table(object$final_annotation, object$batch)[,1:5]##
## 10X Freytag Oetjen_A Oetjen_P Oetjen_U
## CD4+ T cells 2937 1238 577 295 1652
## CD8+ T cells 350 270 93 639 253
## CD10+ B cells 0 0 91 41 75
## CD14+ Monocytes 3388 452 350 263 384
## CD16+ Monocytes 364 25 61 26 78
## CD20+ B cells 1546 427 43 31 417
## Erythrocytes 0 0 186 1219 97
## Erythroid progenitors 0 0 112 249 102
## HSPCs 28 0 227 119 99
## Megakaryocyte progenitors 21 16 72 71 76
## Monocyte progenitors 0 0 183 109 136
## Monocyte-derived dendritic cells 182 0 89 60 65
## NK cells 756 476 26 0 63
## NKT cells 1056 432 389 62 157
## Plasma cells 18 0 33 40 38
## Plasmacytoid dendritic cells 81 11 54 41 38
How does the collection of datasets look without any integration?
Run a standard Seurat pipeline for dimensionality reduction
nfeatures <- 1000
ndim <- 20
object <- FindVariableFeatures(object, nfeatures = nfeatures) %>% NormalizeData() %>% ScaleData() %>% RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)p1_pre <- DimPlot(object, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch before integration")
p2_pre <- DimPlot(object, group.by = "final_annotation", label=T, label.size = 4) + theme(aspect.ratio = 1) + ggtitle("Cell labels before integration")
p1_pre | p2_preAlthough cells mostly cluster by the cell type (as annotated in individual datasets), there are also visible batch effects (seen as dataset-specific clustering)
Let’s quantify dataset/batch mixing using the LISI metric from the Raychaudhuri Lab, and conservation of biological information by means of Average Silhouette Width (ASW) of cell labels in the corrected PCA space:
integrationMetrics <- list()
if (!require("scIntegrationMetrics", quietly = TRUE))
install_github("carmonalab/scIntegrationMetrics") #calculates LISI and Silhouette
library(scIntegrationMetrics)
lisi_perplexity <- 20
integrationMetrics[["uncorrected"]] <- list()
integrationMetrics[["uncorrected"]][["batch_LISI"]] <- mean(compute_lisi(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]]) # perplexity=effective number of each cell's neighbors; use lower than default to make it faster
integrationMetrics[["uncorrected"]][["celltype_ASW"]] <- mean(compute_silhouette(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["uncorrected"]]## $batch_LISI
## [1] 2.526888
##
## $celltype_ASW
## [1] 0.036829
We will now apply STACAS for correcting batch effects.
Standard integration
STACAS takes a list of Seurat objects as input, so let’s first split the merged object into a list of individual batches/datasets
obj.list <- SplitObject(object, split.by = "batch")STACAS Step 1: Find integration anchors
stacas_anchors <- FindAnchors.STACAS(obj.list,
anchor.features = nfeatures,
dims = 1:ndim)
## 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
## Finding integration anchors...STACAS step 2: Dataset integration
object_integrated <- IntegrateData.STACAS(stacas_anchors,
dims=1:ndim)Calculate low-dimensional embeddings and visualize integration results in UMAP
object_integrated <- object_integrated %>% ScaleData() %>% RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)p1_int <- DimPlot(object_integrated, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after integration")
p2_int <- DimPlot(object_integrated, group.by = "final_annotation", label=T, label.size = 2) + theme(aspect.ratio = 1) + ggtitle("Cell labels after integration")
p1_int | p2_intQuantify integration metrics:
integrationMetrics[["STACAS"]] <- list()
integrationMetrics[["STACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])
integrationMetrics[["STACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["STACAS"]]## $batch_LISI
## [1] 3.039418
##
## $celltype_ASW
## [1] 0.2254194
Compared to the non-corrected data, we now observe: i) increase in dataset/batch LISI (i.e. effective number of different batches in a cell neighborhood), indicating a higher mixing ii) increase in Silhouette/ASW, indicating that cells with the same annotation (ie of the same kind) were brought into proximity
One-liner STACAS
The previous integration could have been run from the initial object with a single command:
object_integrated <- object %>% SplitObject(split.by = "batch") %>% Run.STACAS(dims = 1:ndim, anchor.features = nfeatures) %>% RunUMAP(dims = 1:ndim)
DimPlot(object_integrated, group.by = "batch")STACAS integration guide trees
Like Seurat, STACAS uses a guide tree to determine integration order. Optionally, you can check the integration guide tree automatically generated by STACAS (e.g. are samples from the same sequencing technology or from similar tissues clustering together?)
st1 <- SampleTree.STACAS(
anchorset = stacas_anchors,
obj.names = names(obj.list)
)You can tune or manually re-calculate this tree and pass it to
IntegrateData.STACAS:
STACAS step 2 with pre-calculated integration tree
object_integrated <- IntegrateData.STACAS(stacas_anchors,
dims=1:ndim,
sample.tree=st1
)Semi-supervised integration
When available, cell type annotations can be used to guide the alignment. STACAS will use this information to penalize anchors where cell types are inconsistent.
In this dataset, cells were annotated by the authors of the
benchmark. For your own data, you may want to do manual annotation or
apply one of several cell annotation tools, such as SingleR,
Garnett or scGate. We will be
posting some examples of cell type annotation using scGate
in a different demo.
** One-liner semi-supervised STACAS ** by indicating in
cell.labels the metadata column that contains cell
annotations
object_integrated_ss <- obj.list %>% Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "final_annotation" )Note that there is no need for ALL cells to be annotated: we
recommend to set labels to NA or unknown for cells
that cannot be confidently annotated, and they won’t be penalized for
label inconsistency. In addition, you can decide how much weight to give
to cell labels with the label.confidence parameter (from 0
to 1).
Visualize on UMAP space
object_integrated_ss <- object_integrated_ss %>% RunUMAP(dims=1:ndim)p1_ss <- DimPlot(object_integrated_ss, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after semi-supervised integration")
p2_ss <- DimPlot(object_integrated_ss, group.by = "final_annotation", label=T, label.size = 2) + theme(aspect.ratio = 1) + ggtitle("Cell labels after semi-supervised integration")
p1_ss | p2_ssQuantify integration metrics:
integrationMetrics[["semisupSTACAS"]] <- list()
integrationMetrics[["semisupSTACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])
integrationMetrics[["semisupSTACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["semisupSTACAS"]]## $batch_LISI
## [1] 2.926707
##
## $celltype_ASW
## [1] 0.2812887
In this case, by using the semi-supervised approach we observe a gain in biological signal conservation, as measured by cell labels silhouette (celltype_ASW), with a similar dataset/batch mixing (batch_LISI) compared to the unsupervised integration.
Summary of Integration Metrics
if (!require("tidyr", quietly = TRUE))
install.packages("tidyr")
integrationMetricsSummary <- data.frame(unlist(integrationMetrics)) %>% tibble::rownames_to_column() %>% rename(value=unlist.integrationMetrics.) %>% separate(rowname, c("Method","Metric"), sep="\\.")
a <- integrationMetricsSummary %>% filter(Metric=="batch_LISI") %>% ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") + theme(legend.position="none") + ggtitle("batch_LISI")
b <- integrationMetricsSummary %>% filter(Metric=="celltype_ASW") %>% ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity")+ theme(legend.position="none")+ ggtitle("celltype_ASW")
a | bNotes on data integration
The first critical step in batch effect correction is to ensure that the gene names in your datasets are harmonized. If different studies use different naming conventions, synonyms, etc., methods for dimensionality reduction will treat these genes as entirely different variables, and introducing artificial differences between datasets. It is recommended that your pre-processing pipeline takes care of resolving ambiguities in gene naming across datasets. You can use
STACAS:::standardizeGeneSymbolsbut there are other tools such as HGNChelperThe calculation of highly variable genes (HVG) is a fundamental step for dimensionality reduction and integration. We recommend excluding certain classes of genes, e.g. mitochondrial, ribosomal, heat-shock genes, from the list of HVG, as their expression are typically more strongly associated to technical variation than to actual biological differences. The
FindVariableGenes.STACAS()function allows providing a list of genes to exclude from HVG; see an example below.
We can use the collection of signatures stored in SignatuR package
if (!require("SignatuR", quietly = TRUE))
install_github("carmonalab/SignatuR")
library(SignatuR)
print(SignatuR$Hs)## levelName
## 1 Hs
## 2 ¦--Blocklists
## 3 ¦ ¦--Pseudogenes
## 4 ¦ ¦--HSP
## 5 ¦ °--Non-coding
## 6 ¦--Cell_types
## 7 ¦--Programs
## 8 ¦ ¦--cellCycle.G1S
## 9 ¦ ¦--cellCycle.G2M
## 10 ¦ °--IFN
## 11 °--Compartments
## 12 ¦--Mito
## 13 ¦--Ribo
## 14 ¦--TCR
## 15 °--Immunoglobulins
lapply(GetSignature(SignatuR$Hs),head) # retrieve full list of signatures and display ## $Pseudogenes
## ENSG00000256069 ENSG00000250420 ENSG00000240602 ENSG00000249038 ENSG00000234969
## "A2MP1" "AACSP1" "AADACP1" "AARS1P1" "AARSD1P1"
## ENSG00000257408
## "ABCA3P1"
##
## $HSP
## ENSG00000179941 ENSG00000181004 ENSG00000120438 ENSG00000166226 ENSG00000163468
## "BBS10" "BBS12" "TCP1" "CCT2" "CCT3"
## ENSG00000115484
## "CCT4"
##
## $`Non-coding`
## ENSG00000268895 ENSG00000245105 ENSG00000256661 ENSG00000256904 ENSG00000242908
## "A1BG-AS1" "A2M-AS1" "A2ML1-AS1" "A2ML1-AS2" "AADACL2-AS1"
## ENSG00000215458
## "AATBC"
##
## $cellCycle.G1S
## ENSG00000156802 ENSG00000197299 ENSG00000136492 ENSG00000288475 ENSG00000175305
## "ATAD2" "BLM" "BRIP1" "CASP8AP2" "CCNE2"
## ENSG00000093009
## "CDC45"
##
## $cellCycle.G2M
## ENSG00000011426 ENSG00000143401 ENSG00000087586 ENSG00000178999 ENSG00000089685
## "ANLN" "ANP32E" "AURKA" "AURKB" "BIRC5"
## ENSG00000169679
## "BUB1"
##
## $IFN
## ENSG00000282851 ENSG00000254838 ENSG00000254781 ENSG00000126709 ENSG00000163565
## "BISPR" "GVINP1" "GVINP2" "IFI6" "IFI16"
## ENSG00000165949
## "IFI27"
##
## $Mito
## ENSG00000198899 ENSG00000228253 ENSG00000198804 ENSG00000198712 ENSG00000198938
## "MT-ATP6" "MT-ATP8" "MT-CO1" "MT-CO2" "MT-CO3"
## ENSG00000198727
## "MT-CYB"
##
## $Ribo
## ENSG00000147403 ENSG00000198755 ENSG00000165496 ENSG00000142676 ENSG00000197958
## "RPL10" "RPL10A" "RPL10L" "RPL11" "RPL12"
## ENSG00000167526
## "RPL13"
##
## $TCR
## ENSG00000102871 ENSG00000211888 ENSG00000211879 ENSG00000211878 ENSG00000211877
## "TRADD" "TRAJ1" "TRAJ10" "TRAJ11" "TRAJ12"
## ENSG00000211876
## "TRAJ13"
##
## $Immunoglobulins
## ENSG00000211895 ENSG00000211890 ENSG00000211898 ENSG00000236170 ENSG00000237197
## "IGHA1" "IGHA2" "IGHD" "IGHD1-1" "IGHD1-7"
## ENSG00000227108
## "IGHD1-14"
Then we tell STACAS to exclude specific genes from HVG used in
integration, using the parameter genesBlockList
my.genes.blocklist <- c(GetSignature(SignatuR$Hs$Blocklists),GetSignature(SignatuR$Hs$Cell_types,SignatuR$Hs$Compartments))
object_integrated_blockList <- Run.STACAS(obj.list, genesBlockList = my.genes.blocklist, dims = 1:ndim, anchor.features = nfeatures)sessionInfo()## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] SignatuR_0.0.2 tidyr_1.2.0 scIntegrationMetrics_0.1
## [4] STACAS_2.0.0 ggplot2_3.3.6 dplyr_1.0.9
## [7] sp_1.5-0 SeuratObject_4.1.0 Seurat_4.1.1
## [10] remotes_2.4.2
##
## loaded via a namespace (and not attached):
## [1] plyr_1.8.7 igraph_1.3.4 lazyeval_0.2.2
## [4] splines_4.2.0 listenv_0.8.0 scattermore_0.8
## [7] digest_0.6.29 htmltools_0.5.3 fansi_1.0.3
## [10] magrittr_2.0.3 tensor_1.5 cluster_2.1.3
## [13] ROCR_1.0-11 globals_0.15.1 matrixStats_0.62.0
## [16] spatstat.sparse_2.1-1 rmdformats_1.0.4 prettyunits_1.1.1
## [19] colorspace_2.0-3 ggrepel_0.9.1 xfun_0.31
## [22] callr_3.7.1 crayon_1.5.1 jsonlite_1.8.0
## [25] progressr_0.10.1 spatstat.data_2.2-0 survival_3.3-1
## [28] zoo_1.8-10 glue_1.6.2 polyclip_1.10-0
## [31] gtable_0.3.0 leiden_0.4.2 pkgbuild_1.3.1
## [34] future.apply_1.9.0 abind_1.4-5 scales_1.2.0
## [37] data.tree_1.0.0 spatstat.random_2.2-0 miniUI_0.1.1.1
## [40] Rcpp_1.0.9 viridisLite_0.4.0 xtable_1.8-4
## [43] reticulate_1.25 spatstat.core_2.4-4 htmlwidgets_1.5.4
## [46] httr_1.4.3 RColorBrewer_1.1-3 ellipsis_0.3.2
## [49] ica_1.0-3 pkgconfig_2.0.3 farver_2.1.1
## [52] sass_0.4.2 uwot_0.1.11 deldir_1.0-6
## [55] utf8_1.2.2 tidyselect_1.1.2 labeling_0.4.2
## [58] rlang_1.0.4 reshape2_1.4.4 later_1.3.0
## [61] munsell_0.5.0 tools_4.2.0 cachem_1.0.6
## [64] cli_3.3.0 generics_0.1.3 ggridges_0.5.3
## [67] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0
## [70] yaml_2.3.5 goftest_1.2-3 processx_3.7.0
## [73] knitr_1.39 fitdistrplus_1.1-8 purrr_0.3.4
## [76] RANN_2.6.1 pbapply_1.5-0 future_1.27.0
## [79] nlme_3.1-158 mime_0.12 compiler_4.2.0
## [82] rstudioapi_0.13 plotly_4.10.0 curl_4.3.2
## [85] png_0.1-7 spatstat.utils_2.3-1 tibble_3.1.8
## [88] bslib_0.4.0 stringi_1.7.8 highr_0.9
## [91] ps_1.7.1 RSpectra_0.16-1 rgeos_0.5-9
## [94] lattice_0.20-45 Matrix_1.4-1 vegan_2.6-2
## [97] permute_0.9-7 vctrs_0.4.1 pillar_1.8.0
## [100] lifecycle_1.0.1 BiocManager_1.30.18 spatstat.geom_2.4-0
## [103] lmtest_0.9-40 jquerylib_0.1.4 RcppAnnoy_0.0.19
## [106] data.table_1.14.2 cowplot_1.1.1 irlba_2.3.5
## [109] httpuv_1.6.5 patchwork_1.1.1 R6_2.5.1
## [112] bookdown_0.27 promises_1.2.0.1 renv_0.15.5
## [115] KernSmooth_2.23-20 gridExtra_2.3 parallelly_1.32.1
## [118] codetools_0.2-18 MASS_7.3-56 rprojroot_2.0.3
## [121] withr_2.5.0 sctransform_0.3.3 mgcv_1.8-40
## [124] parallel_4.2.0 grid_4.2.0 rpart_4.1.16
## [127] rmarkdown_2.14 Rtsne_0.16 shiny_1.7.2
Further reading
The STACAS package and installation instructions are available at: STACAS package
The code for this demo can be found on GitHub
References
Andreatta A., Carmona S. J. (2021). STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data. - Bioinformatics
Luecken, M. D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M. F., … & Theis, F. J. (2022). Benchmarking atlas-level data integration in single-cell genomics. - Nature methods
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. - Cell